Introduction

Today we’ll be looking at the “Human Origins” (HO) Dataset, whichis a publically accessible dataset and can be found here: https://reich.hms.harvard.edu/datasets.

This document is a tutorial in reading and analysing some data we have generated for you beforehand (because some of the analyses take longer than we have time for and this way you only need to worry about one programming langauge - R). You will be reading in data, plotting it out in R - and you’re encouraged throughout this document to go “off trail” and statistically test your own questions.

Section 1 - Set up

Packages

First of all, let’s set up our workspace, by specifying some packages you will need:

library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("mclust")
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
## 
##     map

If you don’t have them installed in R, you can install with:

install.packages(c("tidyverse","mclust"))

Data Locations

We also want to define the location of the directory which we’ve saved all of this data together. We recommend downloading this to a temp folder on your machine.

datadir <- "D:/temp_data/crt_data/"

Sample Data

We want to now read in our sample and population labels. This is to aid plotting the genetic results later as the HO dataset has a lot of individual populations with small sample sizes which makes plotting “interesting”.

Sample Labels

Note that we’ve taken a subset of the HO dataset, so there are lots more individuals listed here than you’ll be working with today. We’ve decided to focus on invididuals from populations around western Eurasia.

id_data <-paste0(datadir, "human-origins_ind-labels.txt")
id_data <- read_delim(file = id_data, delim = "\t", col_names = T, show_col_types = FALSE) %>%
  select(IID,LABEL)

id_data
## # A tibble: 2,068 x 2
##    IID     LABEL  
##    <chr>   <chr>  
##  1 SA1004  Khomani
##  2 SA063   Khomani
##  3 SA010   Khomani
##  4 SA064   Khomani
##  5 SA073   Khomani
##  6 SA1025  Khomani
##  7 SA078   Khomani
##  8 SA083   Khomani
##  9 Yuk_009 Yukagir
## 10 Yuk_025 Yukagir
## # ... with 2,058 more rows

Population Labels

Now we want to read in the “pop data”, which contains plotting information on the populations we’ve included in this analysis. here we can define plotting colours, shapes, the order in which we plot populations, as well as what “region” (group of populations) they’re from.

pop_data <-paste0(datadir, "human-origins.pop-labels.csv")
pop_data <- read_delim(file = pop_data, delim = ",", col_names = T, show_col_types = FALSE)

pop_data
## # A tibble: 64 x 4
##    REGION LABEL           COLOUR  SHAPE
##    <chr>  <chr>           <chr>   <dbl>
##  1 Caucus Abkhasian       darkred     1
##  2 Caucus Adygei          darkred     2
##  3 Caucus Balkar          darkred     3
##  4 Caucus Chechen         darkred     4
##  5 Caucus Georgian        darkred     5
##  6 Caucus Iranian         darkred     6
##  7 Caucus Iranian_Bandari darkred     7
##  8 Caucus Kumyk           darkred     8
##  9 Caucus Lezgin          darkred     9
## 10 Caucus Nogai           darkred    10
## # ... with 54 more rows

Principal Component Analysis

Let’s look at the PCA results that you have generated on the server. We’ve used the program PLINK for a lot of this, and it’s worth having a look around on that website for other analyses that you could do.

We’ll read in the data, then combine the individual results with population labels, and then plot using the ggplot package-set.

Principal Component Analysis (PCA) is summarising the genetic relationships (distance) between the HO individuals and every other individual in the dataset. The first two principal components (PCs) summarise the top first and second sources of variaiton in the dataset - and are often the preferential components to plot to begin with as they exmaplin the most variation, and are easy to visualise on a screen.

# read in data
vec_data_pruned <- paste0(datadir, "ho-westeurasia.pruned-pca.eigenvec")
vec_data_pruned <- read_delim(file = vec_data_pruned, delim = " ", 
                              col_names = c("FID","IID", paste0("PC",1:20)), show_col_types = FALSE)

# combine PC data with population labels
vec_data_pruned <- vec_data_pruned %>%
  left_join(., id_data, by = "IID") %>%
  left_join(., select(pop_data,LABEL,REGION), by = "LABEL") %>%
  # order the population labels according to the order in the pop_data df
  # by using factors (ordered variables)
  mutate(REGION = factor(REGION, levels = pop_data %>% distinct(REGION) %>% pull(REGION)),
         LABEL = factor(LABEL, levels = pop_data %>% distinct(LABEL) %>% pull(LABEL)))

vec_data_pruned
## # A tibble: 901 x 24
##    FID    IID           PC1    PC2       PC3      PC4      PC5      PC6      PC7
##    <chr>  <chr>       <dbl>  <dbl>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 French HGDP00511 -0.0330 0.0175  0.00124  -0.0121  -3.15e-3  3.22e-3  0.00918
##  2 French HGDP00512 -0.0275 0.0182 -0.00934  -0.0120  -3.01e-3  2.01e-3  0.00132
##  3 French HGDP00513 -0.0285 0.0166 -0.00272  -0.0114  -8.69e-3 -3.92e-4 -0.00672
##  4 French HGDP00514 -0.0343 0.0137 -0.000342 -0.0141   9.73e-4  4.81e-3  0.00389
##  5 French HGDP00515 -0.0340 0.0187 -0.00548  -0.0110   1.36e-3 -3.40e-4  0.00720
##  6 French HGDP00516 -0.0291 0.0115 -0.00974  -0.0101  -3.72e-3 -2.37e-3  0.00581
##  7 French HGDP00517 -0.0297 0.0219 -0.00361  -0.00830 -4.04e-4  2.90e-3  0.00734
##  8 French HGDP00518 -0.0318 0.0172 -0.00485  -0.0141  -7.48e-3  4.19e-4  0.00520
##  9 French HGDP00519 -0.0376 0.0126 -0.00284  -0.0102   3.74e-3  4.95e-3  0.00364
## 10 French HGDP00522 -0.0315 0.0168 -0.0107   -0.00982 -3.76e-3 -1.79e-3  0.00863
## # ... with 891 more rows, and 15 more variables: PC8 <dbl>, PC9 <dbl>,
## #   PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>,
## #   PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, LABEL <fct>,
## #   REGION <fct>
# and plot
ggplot(data = vec_data_pruned, aes(x = PC1, y = -PC2, colour = LABEL, shape = LABEL)) +
  geom_point(size = 2) +
  scale_colour_manual(values = pull(pop_data, COLOUR), name = "Population") +
  scale_shape_manual(values = pull(pop_data, SHAPE), name = "Population") +
  guides(colour = guide_legend(ncol = 2), shape = guide_legend(ncol = 2)) +
  labs(x = "Principal Component 1", y = "Principal Component 2") +
  theme(panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor = element_line(colour = "grey60", linetype = "dotted"),
        legend.key = element_rect(fill = NA, colour = "grey60"))

Looks like we have a genetic map of Europe and the Near East! Note how related populations are clustering together. You can also see different regions start to project towards each other, like Northern Europe and populations from the Caucasus Mountains. This is a nice visualisation of “population structure”, i.e., genetic variation in West Eurasia is stratified, hence the different “clusters” which coincide with populations.

Things to do:

ADMIXTURE Analysis

The next type of population structure detecting analyses which is common is estimate ancestry proportions from programs like STRUCTURE or ADMIXTURE. What we’ve used here is ADMIXTURE. These programs typically take genotype data from n individuals, and then model them as a mixture of k hypothetical populations/components in an “unsupervised” analysis.

That’s what we’ve done here, taking our west Eurasian individuals and modelling them as a variety of k values. For the below code we’ve assumed k=4, but you can easily change the kval variable.

# define the kval to read in
kval <- 4

# read in the .Q file, which is a file with lines = number of individuals, and cols = kval+2
adm_data <- paste0(datadir, "ho-westeurasia.admixture.K", kval, ".Q")
adm_data <- read_delim(file = adm_data, delim = " ",
                       col_names = c("LABEL","IID", paste0("ANC",1:kval)),
                       show_col_types = FALSE)
adm_data
## # A tibble: 901 x 6
##    LABEL  IID         ANC1    ANC2  ANC3  ANC4
##    <chr>  <chr>      <dbl>   <dbl> <dbl> <dbl>
##  1 French HGDP00511 0.0546 0.00557 0.385 0.554
##  2 French HGDP00512 0.0946 0.0273  0.312 0.566
##  3 French HGDP00513 0.0939 0.0240  0.329 0.553
##  4 French HGDP00514 0.0747 0.00001 0.382 0.543
##  5 French HGDP00515 0.0555 0.00567 0.362 0.577
##  6 French HGDP00516 0.130  0.00001 0.324 0.546
##  7 French HGDP00517 0.0519 0.0393  0.342 0.567
##  8 French HGDP00518 0.0726 0.00001 0.345 0.583
##  9 French HGDP00519 0.0563 0.00001 0.399 0.545
## 10 French HGDP00522 0.0752 0.00001 0.327 0.597
## # ... with 891 more rows
# change the label column to a factor so we can order the pops
adm_data <- adm_data %>%
  mutate(LABEL = factor(LABEL, levels = pop_data %>% distinct(LABEL) %>% pull(LABEL)))

# now change the df into the "long" format, rather than the "wide" format
# we need this for plotting
# (see pivot_longer for a description)
adm_data <- adm_data %>%
  pivot_longer(3:ncol(.), names_to = "ANC", values_to = "Prop")
adm_data
## # A tibble: 3,604 x 4
##    LABEL  IID       ANC      Prop
##    <fct>  <chr>     <chr>   <dbl>
##  1 French HGDP00511 ANC1  0.0546 
##  2 French HGDP00511 ANC2  0.00557
##  3 French HGDP00511 ANC3  0.385  
##  4 French HGDP00511 ANC4  0.554  
##  5 French HGDP00512 ANC1  0.0946 
##  6 French HGDP00512 ANC2  0.0273 
##  7 French HGDP00512 ANC3  0.312  
##  8 French HGDP00512 ANC4  0.566  
##  9 French HGDP00513 ANC1  0.0939 
## 10 French HGDP00513 ANC2  0.0240 
## # ... with 3,594 more rows
# summarise the individual ADMIXTURE results to population averages of the ancestry
# proportions
adm_pop_data <- adm_data %>%
  group_by(LABEL, ANC) %>%
  summarise(avProp = mean(Prop), .groups = "drop") %>%
  left_join(., select(pop_data, REGION,LABEL), by = "LABEL") %>%
  mutate(REGION = factor(REGION, levels = pop_data %>% distinct(REGION) %>% pull(REGION)))
adm_pop_data
## # A tibble: 256 x 4
##    LABEL     ANC     avProp REGION
##    <chr>     <chr>    <dbl> <fct> 
##  1 Abkhasian ANC1  0.779    Caucus
##  2 Abkhasian ANC2  0.00254  Caucus
##  3 Abkhasian ANC3  0.148    Caucus
##  4 Abkhasian ANC4  0.0701   Caucus
##  5 Adygei    ANC1  0.666    Caucus
##  6 Adygei    ANC2  0.00001  Caucus
##  7 Adygei    ANC3  0.281    Caucus
##  8 Adygei    ANC4  0.0537   Caucus
##  9 Balkar    ANC1  0.667    Caucus
## 10 Balkar    ANC2  0.000387 Caucus
## # ... with 246 more rows
# plot this with populations grouped by regions, along the y axis, and the proportions
# of the ancestry "components" on the x-axis
ggplot(data = adm_pop_data, aes(y = LABEL, x = avProp, fill = ANC)) +
  geom_bar(stat = "identity", width = 1, colour = NA) +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_discrete(expand = c(0,0)) +
  scale_fill_brewer(palette = "Set2") +
  facet_grid(REGION~., scale = "free_y", space = "free_y") +
  labs(x = "Ancestry Proportion", y = "Population", title = paste0("ADMIXTURE Analysis K=",kval)) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.text.y = element_text(size = 6, angle = 0, hjust = 1, vjust= 0.5),
        strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5),
        strip.background = element_rect(fill = NA, colour = "black"),
        panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid = element_blank())

Things to do: Can you:

Runs of Homozygosity Analysis

We’ve had a look at a few ways to visual population structure in our west Eurasian sample. Now let’s have a look at some data that looks into the demographic history of a population. Firstly let’s investigate the levels of Runs of Homozygosity (ROH) in the populations.

ROH are contiguous sections of the genome at are homozygous in an individual. This is because that individuals’ parents passed on the same copy of the genome (hence the section is now homozygous in their offspring). The amount of ROHs and their length can tell us some things about the populations. As we will see…

We’re going to read in data which records every recorded ROH in all the west Eurasian individuals in our analysis. The code below then calculates the total ROH in each individual, and then calculates the averages for the populations. This data was generated in PLINK, using the --homozyg command (see scripts on the server for more details if you’re interested).

# read in the data, which is a table of each individual ROH detected as a row
roh_data <- paste0(datadir, "ho-westeurasia.roh-1500kb.hom")
roh_data <- as_tibble(read.table(file = roh_data, header = T))
roh_data
## # A tibble: 11,839 x 13
##    FID    IID      PHE   CHR SNP1  SNP2    POS1   POS2    KB  NSNP DENSITY  PHOM
##    <chr>  <chr>  <dbl> <int> <chr> <chr>  <int>  <int> <dbl> <int>   <dbl> <dbl>
##  1 French HGDP0~    -9     3 rs67~ rs28~ 4.86e7 5.02e7 1601.   109   14.7  1    
##  2 French HGDP0~    -9     6 rs10~ rs17~ 2.63e7 2.84e7 2032.   282    7.20 0.989
##  3 French HGDP0~    -9    10 rs12~ rs78~ 7.38e7 7.53e7 1501.   141   10.6  0.986
##  4 French HGDP0~    -9    11 rs74~ rs52~ 4.79e7 5.14e7 3500.    79   44.3  0.975
##  5 French HGDP0~    -9    11 rs25~ rs77~ 5.80e7 5.96e7 1621.    54   30.0  0.981
##  6 French HGDP0~    -9    12 rs71~ rs96~ 3.32e7 3.48e7 1540.   117   13.2  0.983
##  7 French HGDP0~    -9    16 rs71~ rs80~ 4.70e7 4.85e7 1552.    66   23.5  0.97 
##  8 French HGDP0~    -9    17 rs20~ rs98~ 5.75e7 5.94e7 1919.   124   15.5  0.992
##  9 French HGDP0~    -9     1 rs93~ rs46~ 9.93e6 1.51e7 5121.   880    5.82 0.994
## 10 French HGDP0~    -9     2 rs67~ rs36~ 4.60e7 4.90e7 2999.   651    4.61 0.997
## # ... with 11,829 more rows, and 1 more variable: PHET <dbl>
# now for each individual calculate the total number of ROH (NSEG) and the total length
# of ROH in megabases (TotMB).
roh_data <- roh_data %>%
  rename(LABEL = FID) %>%
  group_by(LABEL,IID) %>%
  summarise(NSEG = n(), TotMB = sum(KB)/1000, .groups = "drop")
roh_data
## # A tibble: 901 x 4
##    LABEL     IID        NSEG TotMB
##    <chr>     <chr>     <int> <dbl>
##  1 Abkhasian abh107       13 23.9 
##  2 Abkhasian abh119       15 30.1 
##  3 Abkhasian abh122       11 22.6 
##  4 Abkhasian abh133       16 34.8 
##  5 Abkhasian abh147       11 21.8 
##  6 Abkhasian abh154       15 27.9 
##  7 Abkhasian abh24         7 19.1 
##  8 Abkhasian abh27        10 24.8 
##  9 Abkhasian abh41         4  8.64
## 10 Adygei    HGDP01381    12 41.4 
## # ... with 891 more rows
# then we summarise by population group to find the average of NSEG and TotMb
roh_pop_data <- roh_data %>%
  group_by(LABEL) %>%
  summarise(avNSEG = mean(NSEG), avTotLen = mean(TotMB),
            SENSEG = sd(NSEG)/sqrt(n()), SETotLen = sd(TotMB)/sqrt(n()),
            .groups = "drop") %>%
  left_join(., select(pop_data,REGION,LABEL), by = "LABEL") %>%
  mutate(REGION = factor(REGION, levels = pop_data %>% distinct(REGION) %>% pull(REGION)),
         LABEL = factor(LABEL, levels = pop_data %>% distinct(LABEL) %>% pull(LABEL)))
roh_pop_data
## # A tibble: 64 x 6
##    LABEL      avNSEG avTotLen SENSEG SETotLen REGION     
##    <fct>       <dbl>    <dbl>  <dbl>    <dbl> <fct>      
##  1 Abkhasian    11.3     23.8  1.32      2.46 Caucus     
##  2 Adygei       10.8     27.3  0.905     3.88 Caucus     
##  3 Albanian     11.2     25.1  1.82      7.01 SE Europe  
##  4 Armenian      8.8     23.1  1.40      5.56 Near East  
##  5 Assyrian     14.1     76.0  1.69      8.39 Near East  
##  6 Balkar        9.9     21.9  1.23      4.05 Caucus     
##  7 Basque       18.4     46.5  0.996     3.33 S Europe   
##  8 BedouinA     18.7    120.   2.06     18.8  Middle East
##  9 BedouinB     33.3    149.   2.39     16.5  Middle East
## 10 Belarusian   11.5     22.7  0.749     1.75 E Europe   
## # ... with 54 more rows
# and plot the average amounts of ROH length in each population
ggplot(data = roh_pop_data,
       aes(x = avTotLen, y = LABEL, fill = REGION,
           xmin = avTotLen-SETotLen, xmax = avTotLen+SETotLen)) +
  geom_bar(stat = "identity", width = 1, colour = NA) +
  geom_errorbarh(height = 0.1, colour = "black") +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_discrete(expand = c(0,0)) +
  facet_grid(REGION~., scale = "free_y", space = "free_y") +
  labs(x = "Average Length of ROH >1.5 (MB)", y = "Population") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.text.y = element_text(size = 6, angle = 0, hjust = 1, vjust= 0.5),
        strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5),
        strip.background = element_rect(fill = NA, colour = "black"),
        panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor.x = element_blank())
## Warning: Removed 1 rows containing missing values (geom_errorbarh).

# now plot the *length* of ROH versus the *number* of ROH in each individuals
ggplot(data = roh_pop_data,
       aes(x = avTotLen, xmin = avTotLen-SETotLen, xmax = avTotLen+SETotLen,
           y = avNSEG, ymin = avNSEG-SENSEG, ymax = avNSEG+SENSEG,
           colour = LABEL, shape = LABEL)) +
  # include the standard errors
  geom_errorbar(width = 0, colour = "grey50") + geom_errorbarh(height = 0,  colour = "grey50") +
  geom_point() +
  scale_colour_manual(values = pull(pop_data, COLOUR), name = "Population") +
  scale_shape_manual(values = pull(pop_data, SHAPE), name = "Population") +
  guides(colour = guide_legend(ncol = 2), shape = guide_legend(ncol = 2)) +
  labs(x = "Mean Total Length IBD Segments (cM)", y = "Mean Total Number of IBD Segments") +
  theme(panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor = element_line(colour = "grey60", linetype = "dotted"),
        legend.key = element_rect(fill = NA, colour = "grey60"))
## Warning: Removed 1 rows containing missing values (geom_errorbarh).

Things to do:

Identity-by-Descent Analysis

Now we’ve seen haplotype sharing as measured by ROH, let’s look at IBD segment sharing. Detecting these segments is more computationally involved as we need to phase the genotypes first, i.e. identify the paternal/maternal haplotypes in each individual. What we’ve done is phase each chromosome using Beagle and then used those phased haplotypes to detect IBD segments using refinedIBD. Again, the scripts that generated these are on the server if you are interested.

What we’re going to do is read the segments in, which are spearated into separate files by chromosome. I.e., IBD-segments in chr1 are in a different file from chr2-22.

# for loop over the IBD segment data and read in a list of dataframes
ibd_data <- list()
for (chrom in 1:22) {
  ibd_data[[chrom]] <- paste0(datadir, "ho-westeurasia.chr", chrom, ".refinedibd.len1.mergeibd.gz")
  ibd_data[[chrom]] <- read_delim(file = ibd_data[[chrom]], delim = "\t",
                                  col_names = c("ID1","HAP1","ID2","HAP2","CHROM","START","END","LOD","cM"),
                                  show_col_types = FALSE)
}
# combine this list into one concatenated dataframe
ibd_data <- do.call(rbind, ibd_data)
ibd_data
## # A tibble: 937,799 x 9
##    ID1        HAP1 ID2        HAP2 CHROM     START       END   LOD    cM
##    <chr>     <dbl> <chr>     <dbl> <dbl>     <dbl>     <dbl> <dbl> <dbl>
##  1 HGDP00511     1 HGDP00513     1     1  81775721  83084250  3.36  1.75
##  2 HGDP00511     2 HGDP00513     1     1  85967977  87400241  5.02  1.02
##  3 HGDP00511     1 HGDP00513     1     1 198859198 200147649  3.97  1.09
##  4 HGDP00511     1 HGDP00517     2     1  48050972  48986545  3.27  1.05
##  5 HGDP00511     1 HGDP00528     1     1   5088994   5572521  3.3   1.39
##  6 HGDP00511     2 HGDP00533     1     1  64925891  65647893  4.46  1.18
##  7 HGDP00511     1 HGDP00535     1     1 216249570 217520400  5.18  1.93
##  8 HGDP00511     1 HGDP00537     2     1 108297507 110346885  6.96  2.42
##  9 HGDP00511     1 HGDP00538     2     1 178813531 180193986  3.32  1.05
## 10 HGDP00511     2 HGDP00538     1     1 204871830 205487755  3.11  1.30
## # ... with 937,789 more rows
# filter for segments that are > 3cM in length (smaller segments have a high false-positive
# rate).
# Then for each individual pair (remember an IBD segment is something shared between 
# two individuals)
ibd_data <- ibd_data %>%
  filter(cM > 3) %>%
  group_by(ID1,ID2) %>%
  summarise(NSEG = n(), TotLen = sum(cM), .groups = "drop") %>%
  # also introduce our population labels
  left_join(., id_data, by = c("ID1" = "IID")) %>% rename(LABEL1 = LABEL) %>%
  left_join(., id_data, by = c("ID2" = "IID")) %>% rename(LABEL2 = LABEL)
ibd_data
## # A tibble: 20,755 x 6
##    ID1   ID2          NSEG TotLen LABEL1   LABEL2  
##    <chr> <chr>       <int>  <dbl> <chr>    <chr>   
##  1 A306  A325           13  58.7  Romanian Romanian
##  2 A306  A343            7  41.9  Romanian Romanian
##  3 A306  A362           12  70.9  Romanian Romanian
##  4 A306  A374           12  81.7  Romanian Romanian
##  5 A306  Assyrian160     1   3.18 Romanian Assyrian
##  6 A306  G408            2   7.69 Romanian Romanian
##  7 A306  G421            2   6.49 Romanian Romanian
##  8 A325  A343           20 122.   Romanian Romanian
##  9 A325  A362           20 149.   Romanian Romanian
## 10 A325  A374           18 115.   Romanian Romanian
## # ... with 20,745 more rows
# now we need to calculate the average IBD sharing patterns (mean total length of IBD
# and mean total number of segments) between individuals who are part of the same
# population label
ibd_data_summarised <- ibd_data %>% 
  filter(LABEL1 == LABEL2) %>%
  group_by(LABEL1,ID1) %>%
  summarise(avTotLen = mean(TotLen), avNSEG = mean(NSEG), .groups = "drop") %>%
  mutate(LABEL1 = factor(LABEL1, levels = pop_data %>% distinct(LABEL) %>% pull(LABEL)))

# and plot the average total length versus average total number of segments (like in ROH)
ggplot(data = ibd_data_summarised,
       aes(x = avTotLen, y = avNSEG, colour = LABEL1, shape = LABEL1)) +
  geom_point() +
  scale_colour_manual(values = pull(pop_data, COLOUR), name = "Population") +
  scale_shape_manual(values = pull(pop_data, SHAPE), name = "Population") +
  guides(colour = guide_legend(ncol = 2), shape = guide_legend(ncol = 2)) +
  labs(x = "Mean Total Length IBD Sehgments (cM)", y = "Mean Total Number of IBD Segments") +
  theme(panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor = element_line(colour = "grey60", linetype = "dotted"),
        legend.key = element_rect(fill = NA, colour = "grey60"))

Things to do:

Genetic Distance Work

What we saw in the PCA section is the results of PLINK’s in-built implementation of PCA. This analysis is performed on a “genetic relationship matrix” (or GRM). This is a matrix of covariances between individuals from the SNP data, and is weighted by the variance of each SNP - one of the impacts of this is that rarer SNPs (who are often more informative) are up-weighted.

In PLINK this is created with the --make-rel command, which is what we’ve done. From this matrix of genetic “distances” we can run our own dimension reduction, or we can cluster individuals based on these genetic relationships.

# read in the relationship matrix
rel_data <- paste0(datadir, "ho-westeurasia.pruned-rel.rel")
rel_data <- as.matrix(data.table::fread(file = rel_data))

# read in the id file - which gives us the order of individuals within the matrix
rel_id <- paste0(datadir, "ho-westeurasia.pruned-rel.rel.id")
rel_id <- read_delim(rel_id, delim = "\t", col_names = c("POP","ID"), show_col_types = FALSE) %>%
  mutate(POP = factor(POP, levels = pop_data %>% distinct(LABEL) %>% pull(LABEL)))

# perform MDS analysis on the distance matrix calculated from the rel data
rel_mds <- cmdscale(d = dist(rel_data), k = 2)

# using mclust, cluster individuals based on their rel-matrix values
rel_BIC <- mclustBIC(rel_data)
rel_model1 <- Mclust(data = rel_data, X = rel_BIC)

# update the rel_id dataframe with the Class(ification) from mclust, and the 
# dimensions from MDS
rel_id$Class <- rel_model1$classification
rel_id$MDS1 <- rel_mds[,1]
rel_id$MDS2 <- rel_mds[,2]

# plot the  MDS dimesions by population label
ggplot(data = rel_id, aes(x = MDS1, y = MDS2, colour = POP, shape = POP)) +
  geom_point(size = 2) +
  scale_colour_manual(values = pull(pop_data, COLOUR), name = "Population") +
  scale_shape_manual(values = pull(pop_data, SHAPE), name = "Population") +
  guides(colour = guide_legend(ncol = 2), shape = guide_legend(ncol = 2)) +
  labs(x = "Principal Component 1", y = "Principal Component 2") +
  theme(panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor = element_line(colour = "grey60", linetype = "dotted"),
        legend.key = element_rect(fill = NA, colour = "grey60"))

# plot the MDS dimensions by cluster
ggplot(data = rel_id, aes(x = MDS1, y = MDS2, colour = factor(Class))) +
  geom_point() +
  labs(x = "MDS Dim 1", y = "MDS Dim 2") +
  theme(panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor = element_line(colour = "grey60", linetype = "dotted"),
        legend.key = element_rect(fill = NA, colour = "grey60"))

Things to Do:

Chromosome “Painting”

Now we’re going to look at a different form of genetic relationship matrix, this time measuring haplotype sharing through a process of “chromosome painting”. This process models each individual’s haplotypes as a mixture of donor/reference individuals and significantly outperforms traditional allelic based approaches because we’re using extra information in the form of the linkage between SNPs. If you’re interested further you can read the original paper or the first major application on genotypes in the UK.

For the ease of computation, we haven’t actually use “ChromoPainter”, the application descirbed in those above papers, but an approximation through the PBWT algorithm (an example of using PBWT-paint is here).

What we’ve done, is for every HO individual, we’ve modelled each of their autosomes as a mixture of the haplotypes of everyone else in the dataset - effectively “painting” them as a mixture using the “palette” of everyone else’s haplotypes (if that make more sense). We have 22 files, one for each autosome, of this summarised in a mixture of n by n individuals.

What we’re going to do is read these in, and sum all of the painting together and perform PCA on the summed matrix using some custom PCA functions. Then we’re going to see how different the results are to standard PCA.

# first we read in some custom functions that we're going to use that are designed to perform
# PCA on "chromosome painting" matrices.
mypcanorm<-function(x,zeromean=T,fdiag=T){
  ## Normalise for PCA by setting the diagonal to be the average of the rows, then zero meaning by substracting the row mean (so the diagonal ends up as zero)
  if(fdiag){
    diag(x)<-rowSums(x)/(dim(x)[1]-1)
  }
  if(zeromean){
    return(x-rowMeans(x))
  }else{
    return(x)
  }
}

mypca<-function(x,zeromean=T,fdiag=T){
  ## do PCA on x %*% t(x) , normalising first if desired
  xN<-mypcanorm(x,zeromean,fdiag)
  tcov<-xN %*% t(xN)
  eigen(tcov)   
}
# now we read in the id file, which contains the ids and the labels for the individuals
# *in the order* that they *appear* in the pbwt matrices
pbwt_id <- paste0(datadir, "ho-westeurasia.idfile")
pbwt_id <- read_delim(pbwt_id, delim = "\t", col_names = c("IID","POP","INCL"), show_col_types = F)
pbwt_id <- pbwt_id %>% select(IID,POP)
pbwt_id
## # A tibble: 901 x 2
##    IID       POP   
##    <chr>     <chr> 
##  1 HGDP00511 French
##  2 HGDP00512 French
##  3 HGDP00513 French
##  4 HGDP00514 French
##  5 HGDP00515 French
##  6 HGDP00516 French
##  7 HGDP00517 French
##  8 HGDP00518 French
##  9 HGDP00519 French
## 10 HGDP00522 French
## # ... with 891 more rows
# now we read in the first file, this will form the base of our summations
pbwt_data <- paste0(datadir, "ho-westeurasia.chr1.pbwt-paint.chunkcounts.out")
pbwt_data <- data.table::fread(pbwt_data)
pbwt_data <- pbwt_data %>% column_to_rownames("RECIPIENT")
pbwt_data <- as.matrix(pbwt_data)
pbwt_data[1:5,1:5]
##        IND1   IND2   IND3   IND4   IND5
## IND1 0.0000 0.8780 3.7619 1.4821 1.1587
## IND2 0.3047 3.7747 1.3232 0.6058 1.7135
## IND3 2.5978 1.2422 0.6722 0.5507 0.9802
## IND4 1.9150 0.5908 0.8839 1.3658 0.4825
## IND5 1.3394 1.0071 1.8376 0.4574 0.6905
# next we read in the results of each of the reamining autosomes and add those results
# to the first matrix, getting a genome-wide summary of the chromosome-painting
for (chr in 2:22) {
  tmp_data <- paste0(datadir, "ho-westeurasia.chr",chr,".pbwt-paint.chunkcounts.out")
  tmp_data <- data.table::fread(tmp_data)
  tmp_data <- tmp_data %>% column_to_rownames("RECIPIENT")
  tmp_data <- as.matrix(tmp_data)
  
  pbwt_data <- pbwt_data + tmp_data
}

rm(tmp_data, chr)

pbwt_data[1:5,1:5]
##         IND1    IND2    IND3    IND4    IND5
## IND1  5.2798 13.0034 13.6729 15.6848 12.3517
## IND2 15.4666 16.1712 11.5206 11.6625 13.7965
## IND3 15.6638 10.9284  4.1612  8.6203 13.8992
## IND4 16.0651 12.1530 11.5639 17.4203 12.7300
## IND5 14.6205 11.8383 15.4362 11.6149  7.9828
# then we set up the PCA...
# first we set the diagonal (same individual-painting) to 0
pbwt_data <- pbwt_data
diag(pbwt_data) <- 0

# next we perform PCA on the matrix, taking the frist 20 PCs
pbwt_vec_data <- mypca(pbwt_data)$vectors[,1:20]

# and then we add in the id and label information we read in in the beginning
pbwt_vec_data <- as_tibble(as.matrix(pbwt_vec_data)) %>%
  mutate(IID = pbwt_id %>% pull(IID),
         LABEL = pbwt_id %>% pull(POP)) %>%
  mutate(LABEL = factor(LABEL, levels = pop_data %>% distinct(LABEL) %>% pull(LABEL)))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
colnames(pbwt_vec_data) <- c(paste0("PC",1:20), "IID","LABEL")

pbwt_vec_data
## # A tibble: 901 x 22
##        PC1     PC2       PC3     PC4      PC5      PC6     PC7      PC8      PC9
##      <dbl>   <dbl>     <dbl>   <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1 -0.0305 -0.0214  0.000503 0.00778  3.46e-3 -2.76e-4 0.00729  0.00355 -2.18e-3
##  2 -0.0251 -0.0186 -0.00500  0.00788  2.76e-3 -6.12e-3 0.00938 -0.00364  8.93e-4
##  3 -0.0271 -0.0180 -0.00237  0.00670  5.93e-3  1.13e-3 0.00654  0.00220 -5.07e-3
##  4 -0.0297 -0.0158  0.000749 0.0109  -1.69e-4 -2.90e-3 0.00460 -0.00478  2.64e-3
##  5 -0.0296 -0.0207  0.00127  0.00763 -1.42e-4  2.39e-3 0.00892 -0.00309  1.07e-3
##  6 -0.0249 -0.0138  0.00320  0.00713 -1.04e-3  7.13e-3 0.00559 -0.00364  6.31e-4
##  7 -0.0277 -0.0195 -0.00131  0.00969  1.30e-4  1.72e-3 0.00650 -0.00576 -3.50e-3
##  8 -0.0274 -0.0184 -0.00231  0.00787  2.03e-3  2.85e-3 0.00983 -0.00131 -1.28e-4
##  9 -0.0324 -0.0211  0.00172  0.00780 -8.29e-4  4.64e-4 0.00696 -0.00109 -4.04e-3
## 10 -0.0278 -0.0198 -0.00455  0.00558 -1.96e-4 -7.95e-4 0.0121  -0.00322 -4.67e-3
## # ... with 891 more rows, and 13 more variables: PC10 <dbl>, PC11 <dbl>,
## #   PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
## #   PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, IID <chr>, LABEL <fct>
# and lastly we plot the first two PCs
ggplot(data = pbwt_vec_data, aes(x = PC1, y = PC2, colour = LABEL, shape = LABEL)) +
  geom_point(size = 2) +
  scale_colour_manual(values = pull(pop_data, COLOUR), name = "Population") +
  scale_shape_manual(values = pull(pop_data, SHAPE), name = "Population") +
  guides(colour = guide_legend(ncol = 2), shape = guide_legend(ncol = 2)) +
  labs(x = "Principal Component 1", y = "Principal Component 2") +
  theme(panel.background = element_blank(),
        panel.border = element_rect(fill = NA, colour = "black"),
        panel.grid.major = element_line(colour = "grey60", linetype = "dashed"),
        panel.grid.minor = element_line(colour = "grey60", linetype = "dotted"),
        legend.key = element_rect(fill = NA, colour = "grey60"))

Things to Do:

End Papers

We hope you’ve enjoyed this introduction to the sort of analyses we can do in populations genetics! There are plenty of analyses/programs/approaches we haven’t had time to go over here, but if you are interested the best way is to have a look at how different groups have investigated the population genetics of various groups of people (or other animals!).

If you have any questions about this workshop, or have any questions about the lectures or their speakers feel free to get in contact: